Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery

Background: Drug discovery and development is the extremely costly and time-consuming process of identifying new molecules that can interact with a biomarker target to interrupt the disease pathway of interest. In addition to binding the target, a drug candidate needs to satisfy multiple properties affecting absorption, distribution, metabolism, excretion, and toxicity (ADMET). Artificial intelligence approaches provide an opportunity to improve each step of the drug discovery and development process, in which the first question faced by us is how a molecule can be informatively represented such that the in-silico solutions are optimized. Results: This study introduces a novel hybrid SMILES-fragment tokenization method, coupled with two pre-training strategies, utilizing a Transformer-based model. We investigate the efficacy of hybrid tokenization in improving the performance of ADMET prediction tasks. Our approach leverages MTL-BERT, an encoder-only Transformer model that achieves state-of-the-art ADMET predictions, and contrasts the standard SMILES tokenization with our hybrid method across a spectrum of fragment library cutoffs. Conclusion: The findings reveal that while an excess of fragments can impede performance, using hybrid tokenization with high frequency fragments enhances results beyond the base SMILES tokenization. This advancement underscores the potential of integrating fragment- and character-level molecular features within the training of Transformer models for ADMET property prediction.


Introduction
Drug design has evolved from serendipitous screening of natural compounds to an increasingly rational and data-driven approach, focusing on the molecular structure and mechanisms behind disease-related targets [1].The application of artificial intelligence (AI), particularly machine learning (ML), has revolutionized the pharmaceutical field, which is able to take advantage of the vast arrays of biomedical data that has been gathered [2].AI and ML contribute to various aspects of drug design, including predicting pharmacokinetic and pharmacodynamic properties, identifying binding sites on a given biomolecular target, repurposing drugs, and creating new molecules with desired characteristics, all of which reduce the time and cost associated with developing effective and safe medications [3][4][5].Furthermore, absorption, distribution, metabolism, excretion, and toxicity (ADMET) are crucial in evaluating drug post-administration behaviour, and in minimizing clinical trial failures [6,7].Despite challenges, such as data scarcity and complex molecular structures in the area of ADMET prediction, ML techniques have been able to extrapolate structural patterns that implicate molecular properties, and circumvent the need for costly assays during a large-scale screening process.As a result, ML plays a significant role in the identification and early exclusion of unsuitable compounds, mitigating financial burdens from unsuccessful ventures in the drug development cycle.
In the field of computational chemistry, molecular structures can be represented through various formats.Line notations, such as Simplified Molecular Input Line Entry System (SMILES) [8] provide a textual method to describe the structure of chemical entities, including molecular information such as C for carbon, = for a double bond, parentheses for branches, and @, /, /, for stereochemistry.As an example, climbazole is represented as CC(C)(C)C(=O)C(N1C=CN=C1)OC2=CC=C(C=C2)Cl.Despite the widespread use of SMILES, its strict syntactical guidelines often result in the production of numerous invalid molecular structures, leading to the development of other line notations such as DeepSMILES [9], SELF-Referencing Embedded String (SELFIES) [10], and Group SELFIES [11] which mitigate some of the mentioned issues.However, they are not as widely supported as SMILES and may necessitate larger alphabets.
Fragmentation is another approach to representing a molecule where a large molecule is broken apart into smaller pieces [12].The fragmentation process can reveal important structural and functional features of the original molecule that are not easily discernible from an atomic-level representation such as SMILES.For example, fragmentation can generate sub-molecules that contain specific functional groups or motifs that are relevant for physicochemical properties.However, fragmentation is complex due to the variety of methods and criteria involved in bond cleavage and the selection of submolecular entities [13].Additionally, fragmentation presents several challenges, such as producing sub-molecules that are unusually large or small, and the formation of a vast library of fragments that appear with varying frequencies, with a significant number rarely occurring.
To the best of our knowledge, there exists no direct comparison between fragment and atom-based models for ADMET prediction using the same model.In this study, we construct a hybrid fragment-SMILES encoding technique to combine the advantages of both representations for use in machine language models.As illustrated in Fig. 1, there are a large amount of fragments that occur infrequently.Thus, we construct various models with varying frequency cutoffs to produce a fragment spectrum of models, and perform a fair comparative investigation between SMILES and a fragment spectrum using the hybrid encoding technique for ADMET prediction with a Transformer architecture.Moreover, we also experiment with two pre-training techniques which we denote onephase and two-phase.
The rest of this study is organized as follows.Section "Related works" discusses related works, giving information on the use of Transformers for ADMET prediction, and graph-based neural networks for ADMET prediction.Section "Methodology" discusses the methods used within this work, describing the model and the encoding approach.Afterwards, Section "Experimentation" illustrates all the necessary information for replicating the experiments performed in this study.Following is Section "Results and discussions" which displays and investigates the results of our experimentation.Lastly, concluding remarks are made in Section "Conclusion".

Transformer-based ADMET models
Language models are a class of deep learning models that learn the semantic and syntactic patterns of natural language from a large corpora of text.They can also be applied to molecular sequences, such as SMILES strings, to capture the structural and functional features of molecules.Before the advent of Transformer models, recurrent neural networks (RNNs) were commonly used for language modelling tasks, and also for tasks within ADMET prediction [14,15].One of the advantages of language models is their ability to leverage pre-trained weights in an unsupervised or semi-supervised fashion over a general domain, and then fine-tune for downstream tasks such as ADMET prediction.This process, known as transfer learning, helps improve performance and generalization capabilities by reducing the risk of overfitting and increasing the diversity of molecules for syntactic and semantic understanding.Transfer learning is particularly useful when the data is scarce or noisy, such as is frequent in ADMET prediction where gathering data is costly.
Attention mechanisms are not novel, as they have been applied in RNNs before the birth of Transformers [16,17], however the Transformer architecture emphasizes self-attention to focus on the important sequence sections and capture long-range dependencies and relationships among them [18].As Transformers have shown superior performance over RNNs for natural language processing (NLP) tasks, they too Fig. 1 Fragment library proportions by observation frequency.An integer in the brackets is a threshold.The corresponding percentage represents the portion of fragments above the threshold have become a focus of research towards ADMET prediction.Many of the constructed ADMET Transformer models make use of those popularized in NLP literature, such as but not limited to BERT, RoBERTa, and GPT-2 [19][20][21][22][23][24][25][26][27][28][29].Others combine graph representations with the Transformer to obtain graph-level contextual understanding [30][31][32][33][34][35][36].In addition, some works use a combination of molecular line notation and pre-fabricated descriptors [25], while the remaining use the Transformer with various training strategies and architectural changes [37][38][39][40].Despite the application of various modelling strategies, including pre-training techniques and attention mechanisms for ADMET prediction, a common thread in prior research is the use of SMILES or molecular graph representations.This differs from our investigation which utilizes a hybridized fragment and SMILES encoding using the MTL-BERT model [41].
Although Transformer models have been proposed for ADMET prediction, they face several challenges and limitations, such as data availability, data quality, model interpretability, and robustness.Both data availability and quality are crucial for training accurate and reliable ADMET prediction models, however many ADMET datasets are class imbalanced and can at times be inconsistent.In addition, datasets are often imbalanced in terms of sample quantity when combined for training purposes, and improper sampling strategies are likely to cause catastrophic forgetting among tasks when considering a multi-task approach [42].Model interpretability and robustness are important for understanding the rationale behind predictions and ensuring their applicability in varying scenarios.Equally as important, Transformer models must also be computationally efficient and scalable to handle large-scale and complex molecular datasets.This becomes important with the rise of foundational models in NLP literature, which has in turn spilled over into the cheminformatics domain [37].

Graph-based ADMET models
Graph-based neural network (GNN) models are a popular and effective way of leveraging information gained from using graph-structured data such as molecules [43].A graph is a collection of nodes and edges, where nodes are the building blocks, such as atoms, and edges are connections between entities, like bonds.GNNs learn meaningful representations of molecular structures and properties by aggregating information from local neighbourhoods through different operations such as message passing or convolution.Attention may also be included in GNNs to determine the important neighbouring node features to aggregate.Afterwards, node features constructed by the model can then be combined to obtain a feature vector for the whole graph, which is in turn used for downstream tasks such as ADMET prediction.GNNs have been widely used in ADMET prediction as they can capture the structural and chemical properties of molecules similar to SMILES, and similarly to Transformer models, have the ability to leverage structural information of molecules through transfer learning [44,45].
GNNs applied to ADMET prediction can be broadly categorized into four groups: graph convolutional neural networks (GCNs) [43], graph attention networks (GATs) [46], message passing neural networks (MPNNs) [47], and graph isomorphism networks (GINs) [48].GCNs apply convolution operations on the graph nodes to learn node embeddings, which are then pooled to obtain a graph-level representation.MT-PotentialNet [49], Weave [50], and other models [51,52], are notable works that fall into this division.GATs use attention mechanisms to assign varying weights to neighbouring nodes and edges, allowing the model to focus on the most relevant parts of the graph.ADMETLab 2.0 [53], AttentiveFP [54], and GASA [55] are examples of GATs for ADMET prediction.MPNNs use a message passing scheme to propagate information across the graph, where each node sends and receives messages from its neighbours, and then updates its own hidden state accordingly.In this regard there are models like D-MPNN [56], GeomGCL [57], and MGSSL [58].Lastly, GINs generalize the Weisfeiler-Lehman graph isomorphism test and learn node embeddings by aggregating and transforming features of neighbouring nodes with learnable parameters, afterwards pooled for a graph-level representation.This powerful GNN variant is seen represented in MolGIN [59].It should also be noted that many GNN models for ADMET prediction use a hybridization of the various categories to improve performance [60][61][62][63][64], including multi-task learning to leverage information from multiple ADMET datasets, some of which may have a low amount of samples.As is discussed in [65], many works using graph-based neural network models consider 2D chemical topology, but disregard geometrical data that provides useful information when predicting molecular properties.

Hybrid fragment-SMILES tokenization
Prior to inputting a molecule into a Transformer model, it must undergo tokenization and be encoded into a numerical representation.We propose a novel tokenization procedure, named hybrid fragment-SMILES tokenization (HFST), that incorporates both fragments and SMILES, where SMILES fragments are generated using the method described in HierVAE [66] and DeepFMPO [12].Following their technique, bonds connected away from a ring atom are broken, and an attachment point is inserted for later molecule reconstruction.To encode a molecule using the hybrid method, we first fragment and loop through all fragments.If a fragment is in the vocabulary, we use a single numerical value for encoding.Otherwise, we encode the fragment using the SMILES atomic-level representation.If two or more successive fragments are encoded using SMILES, a separator tag is placed in-between them to designate the ending of one fragment and the start of another.
The advantages of using the HFST encoding over the standard SMILES encoding are fourfold.(1) It overcomes the issue of low-frequent fragments in training.From Fig. 1, we can see that the frequency of fragments follows a power-low distribution.Low-frequent fragments dominate the over-sized vocabulary and thus lead to poor contextual embedding.(2) It solves the problem of new fragments in inference.In predictive tasks, some fragments of new molecules may not exist in the vocabulary of fragments extracted from the training data.Using an unknown token leads to information missing.(3) The fragment spectrum perspective unifies fragment-based and SMILES-based tokenizations, allowing us to select a cutoff that takes benefits of both fragment and SMILES representations.Fragment-based (no cutoff ) and SMILES (all cutoff ) are extreme cases of the HFST representation.(4) It can reduce the sequence length and thus reduce some computational complexity of the Transformer model, which is quadratic with length of input sequence.Our HFST method is illustrated in Fig. 2, where a singular molecule on the left is first fragmented, as seen by the colours blue, red, yellow, and green.Afterwards during encoding, the fragments shaded as blue, yellow and green were found in the vocabulary and so take the form of a single numerical value.However, the red fragment is encoded as a SMILES string as it was not found in the vocabulary, and thus will be encoded as many numerical values, one for each SMILES token.

Transformer ADMET model
We adopt the MTL-BERT model originally proposed by Zhang et al. for predicting ADMET properties from SMILES strings [41].The original MTL-BERT is depicted in Fig. 3.It uses a multi-task learning framework based on BERT, a Transformer-based model that learns bidirectional representations from large-scale unlabelled data.Similarly to BERT, MTL-BERT uses transfer learning which consists of two parts.In the pre-training phase, a masked language model objective is used to learn the contextual information of SMILES sequences from a large corpus of unlabelled molecules.Unlike the BERT model, next sequence prediction is not included as a pre-training objective.Afterwards, the pre-trained MTL-BERT model is fine-tuned on multiple downstream prediction tasks simultaneously with the inclusion of multiple task-specific tokens and prediction layers.The prepending of multiple task-specific tokens to a sequence, one for each predictive task, differs from the original BERT model which prepends a singular token.In the original work by Zhang et al., SMILES enumeration was used as a data augmentation technique to increase diversity, however this is not included in our work due to the generation of rare fragments not present in our curated library.
MTL-BERT is selected as the model for this study as it leverages large-scale unlabelled data to learn contextual information about SMILES strings during pre-training, which has been shown in previously mentioned studies to improve performance in downstream tasks.In addition, as MTL-BERT is inherently a multi-task model, it can benefit from sharing information amongst multiple tasks, enhancing the generalization capability of the model.Multi-task learning is particularly useful for ADMET prediction as there are numerous predictive tasks, many of which have a low amount of samples.Furthermore, as reported in [41], MTL-BERT outperforms the multi-task graph attention (MGA) framework [53] for the same ADMET tasks.We were unable to set up the MGA framework because the code from https:// github.com/ wzxxxx/ MGA is incomplete and lacks of important details.However, we believe that adopting MTL-BERT for the HFST representation in our study is sufficient due to MTL-BERT's reported superior performance.

Experimentation
This section outlines our experimental procedures to assess the performance of the proposed HFST method for ADMET prediction.First, we describe the data used for training our models.This is followed by examining the alteration of fragment vocabulary size through various frequency thresholds, resulting in a spectrum of fragments.Afterwards, the hyperparameters utilized during training to ensure replicability are presented.Additionally, two strategies for pre-training the Transformer model on large-scale unlabelled data are introduced.Lastly, the description of the metrics and methodologies used to ensure an equitable evaluation of the models is given.

Data and preprocessing
We train our model with transfer learning, which segregates the training process into two parts: pre-training and fine-tuning.During the pre-training stage, we use a large collection of unlabelled molecules to train our Transformer models.This ensures the model acquires a generalized representation of molecular structures through self-supervised learning, specifically with masked language modelling.The pre-training data consists of molecules from ChEMBL [67], MOSES [68], and ZINC-250K [14] datasets, where canonical duplicates are removed and SMILES strings above 100 tokens discarded.In total, the dataset comprises roughly 4 million molecules, which is divided into a random 80-20 train-test split.
In the fine-tuning stage, a variety of smaller datasets are utilized to fine-tune the pretrained network, enabling it to concurrently predict 29 ADMET properties through multi-task learning.The fine-tuning data consists of molecules with experimentally measured ADMET values from different sources, having a combined size of 108,315 samples.In line with our pre-training data, SMILES sequences that surpass 100 tokens are removed.Table 1 illustrates the various ADMET datasets with their accompanying size, task type, and ADMET category.All ADMET datasets for fine-tuning were obtained from Therapeutics Data Commons (TDC) [69].

Fragment spectrum
To apply the hybrid fragment-SMILES encoding of molecules, we constructed two vocabularies: one for SMILES and one for fragments.These vocabularies are derived exclusively from the pre-processed pre-training data, as described in the previous section.We use the RDKit Python package [70] to tokenize SMILES strings and the fragmentation technique from HierVAE [66] and DeepFMPO [12] to generate the fragments.Figure 1 illustrates the proportional distribution of fragment frequencies within the pre-training dataset, highlighting that a majority of fragments are uncommon, with In this study, we explore the impact of different fragment frequency thresholds ranging from 2 to 1000, as well as the absence of any threshold, on the efficacy of a Transformer model in predicting ADMET properties using our hybrid tokenization method.

Model hyperparameters
As mentioned previously, we adopt the MTL-BERT model proposed in [41] as the backbone of our experimental framework.In their work, Zhang et al. categorized hyperparameter values by small, medium, and large, where it was reported that the medium parameter size achieved a good balance between predictive performance and computational efficiency.Therefore, we follow their settings and use the medium hyperparameters for our model, as shown in Table 2. Specifically, our model has a hidden size of 256, 8 encoder layers, 8 attention heads, a dropout rate of 0.1, and a feedforward dimension of 1024.We use the Adam optimizer with a learning rate of 1e −4 , betas 0.9 and 0.98, and cross entropy loss to pre-train our model on a large cor- pus of molecules.Then, we fine-tune our model on the task-specific datasets using the AdamW optimizer with a learning rate of 0.5e −4 , the same beta values as in pre- training, mean squared error loss for regression tasks, and binary cross entropy loss for classification tasks.
For both pre-training and fine-tuning, we set the batch size to 64.To monitor the training progress and avoid overfitting, we conduct a testing epoch every 5000 training batches during pre-training and stop the training process if there is no improvement in the testing loss for two consecutive epochs.For fine-tuning, we perform a testing epoch after every training epoch and terminate if the testing loss increases two epochs in a row.In the pre-training stage, 15% of tokens are chosen at random.Of these, there is an 80% probability that a token is substituted with a mask token, a 10% probability of alteration to a random token, and a 10% probability that it will remain unchanged.

One-phase and two-phase pre-training
We experiment with two different pre-training strategies for our Transformer model: onephase and two-phase.In both strategies, we use a large corpus of unlabelled molecular structures as the pre-training data, accompanied with masked language modelling objectives.In one-phase pre-training, the Transformer model is pre-trained using the hybrid fragment-SMILES encoding.This strategy allows the model to directly learn the hybrid encoding without any intermediate steps.In two-phase pre-training, the Transformer model is pre-trained first on the SMILES encoding until no further performance improvement, and then afterwards on hybrid fragment-SMILES encoding until completion.The two-phase approach is designed to capitalize on the insights gained from SMILES encoding before learning the hybrid SMILES-fragment encoding.We hypothesize that the inclusion of low-frequency fragments in the fragment vocabulary may result in reduced visibility of SMILES tokens.Thus, two-phase pre-training allows the model to gradually adapt to the hybrid encoding while preserving the knowledge learned from SMILES embeddings.After pre-training, we perform fine-tuning on the various ADMET datasets using the pre-trained model.

Evaluation
We evaluate the performance of our Transformer model under three scenarios: pre-training (section Pre-training), fine-tuning on 29 ADMET datasets (section Fine-tuning for ADMET prediction), and fine-tuning on the ADMET group benchmarks from TDC (section Fine-tuning on therapeutics data commons ADMET benchmark).For pre-training, we compare model performance by utilizing testing loss and accuracy.For fine-tuning on the 29 ADMET datasets, we compare using area under the receiver operating characteristic curve (AUROC) on classification tasks, and the coefficient of determination ( R 2 ) on regression tasks, both from the testing set, as is indicated in Table 1.For fine-tuning on the ADMET group benchmarks from TDC, various evaluation methods are employed, including mean average error (MAE), AUROC, Spearman's rank correlation coefficient (Spearman), area under the precision-recall curve (AUPRC), each of which are identified, along with its corresponding dataset, in Table 6.Similarly, we report the testing set performance on benchmark datasets.Since the combination of ADMET datasets used in this study are imbalanced with sample size, we adopt a stratified batching strategy during fine-tuning on the 29 ADMET datasets and benchmark datasets, ensuring that each batch contains at least one sample from each dataset.By adopting this approach, we prevent the model from overfitting to larger datasets where samples may be overrepresented in batches, thereby enhancing model generalization.To mitigate the impact of data partitioning on model performance variability, we repeat the entire training procedure 5 times using distinct random seeds specified in Table 2.We also implement fivefold cross-validation when fine-tuning with the 29 ADMET datasets and present both the mean and the standard deviation of the performance metrics across all folds and the 5 training runs.For fine-tuning on the TDC benchmark datasets, a 70-10-20 train-validation-test scaffold split [71] is performed, with 5 training runs, along with expression of the mean and standard deviation of all performance metrics.

Pre-training
The results of all pre-training experiments in the final testing epoch, with mean and standard deviation among the five executions, is shown in Table 3. Indicated is a negative correlation between the frequency cutoff and both loss and accuracy.The accuracy in the pre-training stage is calculated for the prediction of masked tokens.This implies that MTL-BERT has more difficulty in predicting the masked fragment tokens when infrequent, and diverse fragments are used in place of SMILES.With a reduction in the cutoff, there is a swift rise in the number of fragments, as depicted in Fig. 1.A decrease in frequency cutoff is likely to lead to lower accuracy due to the increased presence of rarely occurring fragments, which the model may not effectively contextualize.The performance difference between the one-phase and two-phase pre-training strategies is also observed.
Apart from the outcomes at the 1000 frequency level, the two-phase approach invariably results in a slightly higher loss and improved accuracy.This may be due to the difficulties in contextualizing a hybrid sequence input that starts with SMILES tokenization and subsequently employs hybrid-fragment tokenization.In contrast, the one-phase strategy teaches the model to contextualize both fragments and SMILES simultaneously from the start.Hence, two-phase models might have to recalibrate their SMILES contextualization alongside integrating fragment information, which can result in a diminished overall performance compared to the one-phase pre-training approach.It is important to note that the comparison of pre-training performance across different cutoffs and pure SMILES, as shown by the testing loss and accuracy in Table 3, is biased, as the varying sizes of the vocabularies indicate differing degrees of challenge.The average pre-training curves of hybrid tokenization models on the testing set are illustrated in Fig. 4, separated by one-phase and two-phase.The results show a rapid improvement in the first epoch, which is characterized by a steep learning curve.This is followed by a more gradual progression in subsequent epochs, with less improvement but still evident.While the SMILES performance pattern is not illustrated, it mirrors the hybrid models with a significant initial improvement, although demonstrating lower loss values and higher accuracy rates at the end of training.The observed trends raise questions about the optimal configuration of the learning rate.Rapid early improvements hint at a robust initial grasp of data representation learning, however the plateau in later stages implies a potential overfitting or inability to further generalize from the training data.Adjusting the learning rate could help the model learn more effectively throughout the pre-training process, however we did not do so due to the high computational demands of the pre-training phase, in conjunction with the need to tune the learning rate according to each fragment frequency cutoff.
Furthermore, the consistent, although marginal, gains after the first epoch suggest that the models are still extracting valuable information at a reduced rate.This could imply that the models are approaching their capacity for learning from the given dataset, or that the complexity of the data requires more nuanced learning strategies.Further improvements to modelling could be found from an increased, and complex dataset of molecules, altering the masking rates of the masked language modelling strategy, or employing different learning strategy than masked language modelling.In summary, in this section, we investigated the performance of our HFST strategy in masked model pre-training, finding it to be a challenging task as more infrequent fragments are included in the vocabulary.In the context of the testing loss metric, it is observed that the performance marginally declines with the application of two-phase pre-training compared to a single phase approach.This reduction in performance may Fig. 4 Pre-training curves in one-phase and two-phase strategies, averaged among executions be attributed to the necessity of recontextualizing the embeddings upon transitioning from the initial to the subsequent phase.Unlike two-phase, the one-phase approach employs our proposed HFST method from the outset, thereby averting the need for recontextualization.While the one-phase approach demonstrates a preferable performance compared to the two-phase approach during pre-training, we evaluate the efficacy of our proposed method for ADMET prediction in the subsequent section.

Fine-tuning for ADMET prediction
The performance of the hybrid and SMILES tokenization models during the final testing epoch, averaged across multiple runs and folds, is presented in Figs. 5 and 6.Additionally, Tables 4 and 5 state the resultant values, categorized by the two-phase and one-phase strategies.
In the two-phase approach, we observed that SMILES tokenization consistently achieved the best performance, followed closely by the 1000 frequency hybrid tokenization, with worse metric values as more infrequent fragments are incorporated.With rare fragments included, the model fails to effectively contextualize rare fragments and accurately predict ADMET properties.Interestingly, the 1000 frequency hybrid approach outperformed SMILES specifically in CYP2D6 substrate classification and hERG regression tasks, both of which are critical for drug metabolism and safety within the cardiovascular system.
When using one-phase pre-training, both SMILES and 1000 frequency hybrid tokenizations emerge as top performers, with a trend of lower performance for lower frequency cutoffs persisting between the one-phase and two-phase strategies.Notably, the one-phase 1000 frequency hybrid tokenization consistently outperforms SMILES across most tasks, with an exception in the blood-brain barrier predictive task, where 10 frequency matched SMILES and 1000 frequency hybrid.As well, 500 frequency hybrid outpaces all remaining frequencies and SMILES for microsome clearance prediction.This suggests that specific tasks benefit from varying tokenization strategies, as certain molecular substructures carry high predictive weight.
Overall, our models demonstrated better performance on classification tasks than on regression tasks.The binary nature of the ADMET classification tasks makes them inherently easier to predict than regression tasks, which take on continuous values.Both the hybrid and SMILES tokenization model exhibited poor performance on the half-life dataset, with suboptimal average prediction and a high standard deviation, as seen in Figs. 5 and 6.This dataset likely poses unique challenges due to the complex interplay of molecular features that affect drug half-life, some of which may not be directly related to the molecule itself.Despite the poor performance overall, one-phase 1000 frequency hybrid tokenization still outperformed SMILES on the half-life task, suggesting that the hybrid approach offers resilience in challenging predictive tasks.

Fine-tuning on therapeutics data commons ADMET benchmark
We fine-tune our HFST model and SMILES model utilizing the ADMET group benchmark from TDC, encompassing a total of 22 datasets.The mean and standard deviation test set performance of the 1000 frequency one-phase HFST and SMILES tokenization models is presented in Table 6, provided with the corresponding performance metrics as explained in Sect.4.5, and highlighting of best metric values.Furthermore, we provide a comparative analysis between our model and five non-ensemble models that are prominently featured on the leaderboard, having submitted entries across all benchmark datasets.This comparison spans a diverse array of machine learning methodologies, ranging from conventional ADMET prediction employing molecular fingerprints and multilayer perceptron (MLP) models to contemporary deep learning methods, including convolutional neural networks (CNNs) and GNNs.The models used for comparison include Basic ML [72], DeepPurpose (with variants Morgan + MLP and CNN, each executed separately) [73], Chemprop (a message passing GNN model) [74], and AttentiveFP (a GAT model) [54].
The results in Table 6 reveal HFST demonstrating superior performance over the traditional SMILES notation for molecular language modelling in a majority of the benchmark tasks, as corroborated by the findings in Sect.5.2.Notably, the one-phase 1000 frequency HFST excels in predicting bioavailability and hepatocyte clearance, while SMILES tokenization shows its strengths in CYP 2C9 substrate and microsome clearance tasks.This performance underscores the potential of HFST in certain ADMET applications.Extending beyond our approach, no single model dominates across all 22 benchmark datasets, suggesting that a tailored approach of selecting specific models for specific tasks may yield the most effective strategy in predictive modeling for drug discovery.This observation aligns with the current absence of language models on the TDC leaderboard and the prevalence of GNNs, indicating a need for enhancements in training Transformer models to establish their competitive edge in this domain.The results collectively signal an opportunity for the development of more robust models capable of consistent performance across a diverse array of ADMET prediction tasks.

Conclusion
In this study, we explore the impact of a novel hybrid fragment-SMILES tokenization (HFST) procedure alongside two pre-training strategies for Transformer-based ADMET prediction, while experimenting with a spectrum of fragment vocabularies.Our findings underscore the critical role of data representation and learning methodology in achieving accurate predictions for classification and regression tasks.Although SMILES tokenization remains a robust baseline, our hybrid approach, especially at the 1000 frequency level, consistently outperforms SMILES tokenization, under both a collection of 29 ADMET datasets and the TDC ADMET group benchmark.However, it is important to recognize that the selection of a frequency cutoff significantly impacts model performance, and incorporating lower frequency fragments tends to have a detrimental affect on ADMET predictions.Therefore, adjusting the fragments frequency emerges as an important hyperparameter that requires tuning before model training.
From our experimentation, the need for learning rate optimization is clear, and further tuning could yield substantial improvements in ADMET prediction accuracy.Due to the large computational cost of each experiment, and the number of experiments performed in this study, we did not tune the learning rate.However, we predict that for each differing frequency cutoff vocabulary, the learning rate must be tuned.In addition, we propose adjusting the masked language modelling approach to prioritize converting fragments into mask tokens and assigning them a higher weight during loss calculation.By doing so, the model should more effectively contextualize between fragments and SMILES tokens within our hybrid approach.Given the limited efficiency of the masked language modelling strategy, where only 15% of tokens are used for prediction, we recommend exploring an encoder-decoder Transformer model and language modelling strategy.A full Transformer model learns to contextualize entire sequences at once, potentially addressing some of the limitations observed in our current approach.Last but not the least, the hybrid encoding idea is applicable to other line notation representation methods for molecules, such as SELFIES [10], along with suitable fragmentation techniques, and to a range of quantitative structure activity relationship (QSAR) predictive tasks.This generalization needs to be comprehensively studied as future work.

Appendix A: Comparison of ADMET modelling experimentation
The detailed results for the two-phase and one-phase strategies pre-training and then fine-tuning on ADMET tasks are given in Tables 4 and 5.

Fig. 5
Fig. 5 Comparison between two-phase fine-tuning experimentation and SMILES for a classification and b regression tasks, averaged among folds and executions

Fig. 6
Fig. 6 Comparison between one-phase fine-tuning experimentation and SMILES for a classification and b regression tasks, averaged among folds and executions

Table 1
Fine-tuning datasets a select few being prevalent.The figure further segments the cutoff thresholds with vertical lines and shows the proportion of fragments meeting or surpassing these values, emphasizing the rarity of most fragments. only

Table 3
Pre-training results on the testing set in terms of mean and standard deviation among five executions

Table 4
Comparison of ADMET modelling experimentation using two-phase pre-training

Table 4 ( continued )
Entries in boldface highlight the best mean performance in individual tasks

Table 5
Comparison of ADMET modelling experimentation using one-phase pre-training

Table 6
Comparison of ADMET prediction models on TDC ADMET group benchmarkEntries in boldface highlight the best mean performance in individual tasks